1. Executive summary¶
- Udemy development category contains more than 10000 courses.
- A few courses aggregated a lot of subscribers and have estimated revenue of up to 7.50 millions (conservative estimation).
- But most courses don't enjoy te same success. The top 10% of the courses by number of subscribers amass more than 2/3 of all subscribers.
- It's hard to associate the success of a course to specific variables but we see that for the top 10% there's a slight tendency towards higher number of reviews, longer curriculum and content length, and higher prices.
- The number of reviews is associated with the number of subscribers only for the top 10% of the courses. We hypothesize there's a volume effect from a certain number of reviews where there's enough social proof to attract subscribers.
- The ratings in Udemy are generally high. We have not seem ratings being highly influencial of course success as long as they are high enough (E.g. above 4.0), then we see slightly higher tendency to find a course with more subscribers, especially at the top 10%.
- Most courses are in the lowest two price category in Udemy ranges (19.99$ to 89.99$). But the top 10% courses have a tendency to be priced higher. Given we have only a price snapshot, and Udemy is constantly running discounts, it's hard to speculate. Our hypothesis is that once prices are discounted and enter the affordability range, then the price drop plays a role in creating a fear of missing out and driving the buying decision.
- Since 2020 the number of courses launched per year in the development category has decreased. The launched data (age) of a course doesn't seem to play a big role in course success. However, it's less likely to find a course in the top 10% that has been launched in the last 2 or 3 years. Maybe time to amass subscribers is necessary.
- In the top 10% courses have a slight tendency towards longer content and more comphensive curriculums. You cannot observe the same in other groups. We think that these help students evaluate how thorough a course is as they audit courses before buying.
- Most Udemy courses are target at all levels or begineers. We think that reflects Udemy positioning, and course creators targeting experts should likely publish their course also in other platforms.
- Web Development, Programming Languages, and Data Science are in this order the most popular subcategories for course creators and students. For every subcategory you'll find courses with varying degree of success, and many outliers. You cannot assume a course will be successful just by being in a high demand subcategory.
- A few of the 1343 topics (labels in our data) attract a lof of courses and subscribers. For example, among the top 50 topics, Python, as the topic with highest supply and demand, attracts more than 28M subscribers into 994 courses, while Docker attracted 1.90M subscribers into 106 courses. The top 10% of topics concetrate most subscribers but once again you can't find courses of varying levels of success at each topic, so choosing a high demand topic doesn't guarantee success.
- Lastly, we see 3843 instructors in this development category, of which 1590 produced more than 1 course. There are 38 instructors that attracted more than 1M subscribers, most of them needed multiple courses to achieve that. Looking at instructor success rate (average of subscribers per course launched), those with more than 100k subscribers/course, are only 62. We see that very few instructors can sustain a success rate above 30k subscribers per course.
2. Loading the data¶
The scrapped data is a 75MB json document containing around 10000 course details. The data has already been transformed and partially cleaned. The transformation scripts generated two datasets:
- courses_numerical_categorical_data.csv contains non-textual fields for a typical exploratory data analysis.
- courses_text_data.csv contains only textual fields for NLP analysis.
As the dataset has already been cleaned and transformed there's no values missing. Some columns are the result of the processing steps, and many fields from the original json have been removed.
# Importing charting libraries
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as pio
# Importing seaborn for heatmap/correlation matrix
import seaborn as sns
import matplotlib.pyplot as plt
# This ensures Plotly output works in VS Code and exporting html:
# plotly_mimetype: VS Code notebook UI
# notebook: "Jupyter: Export to HTML" command in VS Code
# See https://plotly.com/python/renderers/#multiple-renderers
pio.renderers.default = "plotly_mimetype+notebook"
# Importing pandas
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format # Setting default display on print to 2 floating points
# Get paths of CSVs produced by transformation scripts
data_folder_path = "../data/"
file_path = data_folder_path + "courses_numerical_categorical_data.csv" # For this analysis we'll focus only on categorical or numerical data
# Read csv into a panda dataframe
courses = pd.read_csv(file_path)
courses.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9969 entries, 0 to 9968 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 udemy_id 9969 non-null int64 1 title 9969 non-null object 2 instructors 9969 non-null object 3 locale 9969 non-null object 4 created 9969 non-null object 5 num_subscribers 9969 non-null int64 6 rating 9969 non-null float64 7 num_reviews 9969 non-null int64 8 num_quizzes 9969 non-null int64 9 num_lectures 9969 non-null int64 10 num_curriculum_items 9969 non-null int64 11 category 9969 non-null object 12 subcategory 9969 non-null object 13 labels 9969 non-null object 14 content_length_minutes 9969 non-null int64 15 content_length_hours 9969 non-null float64 16 instructional_level 9968 non-null object 17 price 9969 non-null float64 dtypes: float64(3), int64(7), object(8) memory usage: 1.4+ MB
Above we see the the categorical and numerical fields in the data created by the transformation scripts. For this analysis I preserved the title which is a text field to provide context in charts and tables.
3. Understanding the target variable num_subscribers¶
Our analysis target variable is the number of subscribers (num_subscribers). Every Udemy user can subscribe to any number of courses. A course success is naturally defined by how many subscribers it has.
3.1 The feeling of success¶
As our goal is to understand how to break into the top courses, let's have a quick look at the top 10 courses to get a feel for it, and estimate their revenue. When looking at the courses below it's interesting to see that the top 10 courses are very lucrative, with half of them involving Python.
# Calculating potential revenue for each course
# Adding 80% discount because Udemy courses are often on sale. This is the minimum possible revenue.
# The 0.0000001 is moving it to millions
courses['estimated_revenue'] = round(courses['num_subscribers'] * courses['price'] * 0.2 * 0.0000001, 2)
# Getting the top 10 after sortubg by revenue and keeping only the columns relevant for printing
courses_top_10_revenue = (
courses
.sort_values(by='estimated_revenue', ascending=False)
.head(10)[['subcategory',
'title',
'num_subscribers',
'num_reviews',
'rating',
'price',
'estimated_revenue']]
.rename(columns={'subcategory': 'Subcategory',
'title': 'Title',
'num_subscribers': 'Number Subscribers',
'num_reviews': 'Number of Reviews',
'rating': 'Rating',
'price': 'Price',
'estimated_revenue': 'Estimated Revenue (in Millions)'})
)
courses_top_10_revenue
| Subcategory | Title | Number Subscribers | Number of Reviews | Rating | Price | Estimated Revenue (in Millions) | |
|---|---|---|---|---|---|---|---|
| 0 | Programming Languages | The Complete Python Bootcamp From Zero to Hero... | 1875450 | 498952 | 4.58 | 199.99 | 7.50 |
| 5 | Web Development | The Complete JavaScript Course 2024: From Zero... | 892956 | 202134 | 4.71 | 199.99 | 3.57 |
| 4 | Programming Languages | React - The Complete Guide 2024 (incl. React R... | 843141 | 207190 | 4.63 | 199.99 | 3.37 |
| 1 | Web Development | The Complete 2024 Web Development Bootcamp | 1219060 | 365816 | 4.70 | 119.99 | 2.93 |
| 8 | Data Science | Machine Learning A-Z: AI, Python & R + ChatGPT... | 1037484 | 182664 | 4.53 | 139.99 | 2.90 |
| 3 | Web Development | The Web Developer Bootcamp 2024 | 903652 | 270626 | 4.68 | 149.99 | 2.71 |
| 12 | Programming Languages | Automate the Boring Stuff with Python Programming | 1120833 | 112019 | 4.65 | 119.99 | 2.69 |
| 2 | Programming Languages | 100 Days of Code: The Complete Python Pro Boot... | 1213974 | 282675 | 4.68 | 109.99 | 2.67 |
| 7 | Programming Languages | Java 17 Masterclass: Start Coding in 2024 | 845037 | 194562 | 4.55 | 139.99 | 2.37 |
| 15 | Programming Languages | Learn Python Programming Masterclass | 423910 | 101645 | 4.60 | 199.99 | 1.70 |
3.2 The distribution of number of subscribers per quantile¶
Below we get an overview of some percentiles (ordered by number of subscribers) of the data. We see quickly that a few courses are very successful where at least until the third quartile they enjoy comparatively mild success.
# Getting the percentiles and using pandas describe to get basic measures of central tendendy and dispersion
courses['num_subscribers'].describe(percentiles=[0.25, 0.50, 0.75, 0.80, 0.85, 0.90, 0.95, 0.99])
count 9969.00 mean 15308.89 std 50499.90 min 44.00 25% 954.00 50% 3515.00 75% 12297.00 80% 16097.00 85% 21890.20 90% 31173.80 95% 58524.20 99% 206662.60 max 1875450.00 Name: num_subscribers, dtype: float64
Immediately we'll see below there are natural outliers in the data. They are not data noise or input errors, but highly successful courses, therefore we'll not remove them from analysis.
# Sort data frame by number of subscribers
courses.sort_values(by='num_subscribers', ascending=False, inplace=True)
# Create violin plot
fig_dispersion_subscribers = px.violin(
courses,
y="num_subscribers",
points="all",
hover_data=courses.columns
)
# Configure chart labels
fig_dispersion_subscribers.update_layout(
title="Distribution of the number of subscribers across courses",
xaxis_title="Courses",
yaxis_title="Total Number of Subscribers"
)
fig_dispersion_subscribers.show()
If we remove the top 10% (last decile by number of subscribers) we see how the distribution changes.
# Calculate the 90th percentile value for the number of subscribers
threshold = courses['num_subscribers'].quantile(0.9)
# Filter the dataframe based on the 90th percentile threshold
courses_above_90p = courses[courses['num_subscribers'] >= threshold]
courses_below_90p = courses[courses['num_subscribers'] < threshold]
# Create violin plot
fig_dispersion_subscribers = px.violin(
courses_below_90p,
y="num_subscribers",
points="all",
box=True,
hover_data=courses_below_90p.columns
)
# Configure chart labels
fig_dispersion_subscribers.update_layout(
title="Distribution of the number of subscribers across courses (excluding top 10%)",
xaxis_title="Courses",
yaxis_title="Total Number of Subscribers"
)
fig_dispersion_subscribers.show()
If we group the courses by deciles, and focus on earlier deciles, we get a better idea how the subscribers dispersion in each group changes. We see higher deviations as we go up in the deciles.
# Calculate deciles and add a new column to the dataframe
courses['decile'] = pd.qcut(
courses['num_subscribers'],
q=10,
labels=['Decile 1', 'Decile 2', 'Decile 3', 'Decile 4', 'Decile 5',
'Decile 6', 'Decile 7', 'Decile 8', 'Decile 9', 'Decile 10']
)
# Create a box plot to visualize dispersion across deciles
fig_dispersion_per_decile = px.box(
courses[~courses['decile'].isin(['d10'])],
x="num_subscribers",
color="decile"
)
# Configure chart labels and focus view
fig_dispersion_per_decile.update_layout(
title_text="Dispersion of Number of Subscribers per Decile",
yaxis_title="Decile",
xaxis_title="Total Number of Subscribers",
xaxis=dict(range=[0, 120000]) # Set the initial y-axis range to focus on lower deciles
)
# Add an annotation to explain the chart is opened in a focused view
fig_dispersion_per_decile.add_annotation(
text="<sup>Opened with Zoom below 120k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
fig_dispersion_per_decile.show()
Another way to visualize the spread of "success" is to visualize the cumulative number of subscribers per decile.
# Create a new dataframe with deciles and total number of subscribers
sum_subscribers_per_decile = (
courses
.groupby('decile', observed=True)['num_subscribers']
.sum()
)
fig_decile_totals = px.bar(
x = sum_subscribers_per_decile.index,
y = sum_subscribers_per_decile.values,
labels={'x': 'Deciles', 'y': 'Total Num Subscribers'}
)
fig_decile_totals.update_layout(title_text="Total Number of Subscribers by Decile")
fig_decile_totals.show()
So now we have a clear picture that success indeed is very concentrated. Let's go into detail to see the success drivers.
4. Why a course is successful?¶
Looking at the data we can outline a few hypothesis for investigation.
- Courses with more reviews will have more subscribers (maybe there's a social proof effect).
- Courses with higher average rating will have more subscribers (sense of value).
- Price category will influence the number of subscribers.
- Time of publication will influence the number of subscribers (there's time demand to accummulate success).
- Longer length of curriculum & video lectures will drive up subscribers (Sense of completion).
- Courses for beginners or intermediates will have more subscribers than for experts.
- Number of subscribers will be concentrated in just few subcategories (E.g. Data Science, Web Development, etc).
- Number of subscribers will be concentrated in just a few labels like
Python. - Number of subscribers will be concentrated in a few instructors (the ones that figured out the success formula, and use one course to cross-promote another)
Before we deep dive into each hypothesis, let's have a look at the correlation matrix between numerical fields to get an overview of these hypotheses.
# Calculate correlation matrix excluding non-numerical fields (using pandas embedded method)
courses_correlation_matrix = courses.corr(numeric_only=True)
# Plot the heatmap for the correlation matrix
plt.figure(figsize=(16, 8))
sns.heatmap(
courses_correlation_matrix,
annot=True,
cmap='coolwarm',
fmt=".2f",
xticklabels=courses_correlation_matrix.columns,
yticklabels=courses_correlation_matrix.columns,
linewidths=.5,
cbar_kws={"shrink": .5}
)
# Add title and rotate the x-axis labels
plt.title('Correlation Matrix Heatmap', fontsize=20)
plt.xticks(rotation=90)
# Highlighting the number of subscribers column (our target variable)
num_subscribers_index = courses_correlation_matrix.columns.get_loc('num_subscribers')
plt.gca().axvline(num_subscribers_index, color='red', linestyle='-', linewidth=4) # Left border
plt.gca().axvline(num_subscribers_index + 1, color='red', linestyle='-', linewidth=4) # Right border
plt.show()
The only strong positive correlation we can see from this heatmap is between the number of reviews and the number of subscribers. It's understandable as Udemy prompt students to review a course early on, and the more students the more review opportunities there are.
Other very weak positive correlations for the number of subscribers are the number of curriculum items (here includes lectures and quizzes). Maybe students judge a course quality by how comprehensive its curriculum is, and by the number of reviews.
What would happen if we looked at the correlation only for the number of subscribers, and saw how it changes across deciles?
# Create a correlation matrix per decile
correlation_matrix_per_decile = (
courses
.drop(columns=['udemy_id', 'estimated_revenue']) # Removing columns that are meaningless or created by us for other purpose
.groupby('decile', observed=True) # Group by decile
.corr(numeric_only=True) # Calculate correlation matrix for each decile
)
# Plot the heatmap
plt.figure(figsize=(16, 8))
sns.heatmap(
correlation_matrix_per_decile.xs('num_subscribers', level=1),
annot=True,
cmap='coolwarm',
fmt=".2f"
)
plt.title('Correlation Matrix by Decile - Only Number of Subscribers', fontsize=20)
plt.show()
We see that outside rating, other correlation values jump up in the last decile. While this is not causation, we could roughly say that, successful courses (last decile) have in general slightly more comprehensive curriculums, longer content, higher prices, and more reviews than other deciles.
We'll dive into each of these topics, but let's look discuss the number of reviews first hypothesis now.
4.1 Courses with more reviews will have more subscribers. There's a social proof effect.¶
The number of subscribers is positively correlated with the number of reviews only in the last two decile, especially in the last decile where the correlation is very strong.
At the same time, courses with more subscribers doesn't have more reviews across other deciles.
I wonder if there's a volume (number of reviews) where a course hits 'social proof' that is can attract subscribers much faster than others.
4.2 Courses with higher average rating will have more subscribers.¶
Let's look at the distribution of ratings first to see what 'bad' and 'good' rating means at Udemy.
courses['rating'].describe()
count 9969.00 mean 4.25 std 0.44 min 1.44 25% 4.01 50% 4.34 75% 4.56 max 5.00 Name: rating, dtype: float64
As suspected, we can see above that even across all quartiles, the ratings at Udemy are relatively high.
To take a deeper look, let's check it per decile again (as a reminder ordered based on number of subscribers), and look at the mean, and median of the normalized ratings per decile.
# Normalizing ratings
min_rate = courses['rating'].min()
max_rate = courses['rating'].max()
# Create a copy and add new column with normalized ratings
courses_with_normalized_rating = courses.copy()
courses_with_normalized_rating['rating_normalized'] = round((courses_with_normalized_rating['rating'] - min_rate) / (max_rate - min_rate), 2) # Normalization from 0 to 1, and keeping 2 decimals
# Calculate mean and median rating
decile_rating = (
courses_with_normalized_rating
.groupby('decile', observed=True)['rating_normalized']
.agg(['mean', 'median', 'count'])
)
decile_rating
| mean | median | count | |
|---|---|---|---|
| decile | |||
| Decile 1 | 0.76 | 0.78 | 997 |
| Decile 2 | 0.76 | 0.78 | 997 |
| Decile 3 | 0.78 | 0.80 | 997 |
| Decile 4 | 0.79 | 0.81 | 997 |
| Decile 5 | 0.79 | 0.81 | 998 |
| Decile 6 | 0.80 | 0.82 | 995 |
| Decile 7 | 0.80 | 0.83 | 997 |
| Decile 8 | 0.80 | 0.82 | 997 |
| Decile 9 | 0.80 | 0.83 | 997 |
| Decile 10 | 0.82 | 0.85 | 997 |
We see above that the average rating does not significantly increase at each decile. We can also visualize this relationship clearly below, but we here we see that in the last decile there's a higher tendency towards higher ratings.
# Scatterplot of number of subscribers vs rating
fig_subscribers_ratings = px.scatter(
courses,
y = "num_subscribers",
x = "rating",
trendline= "ols",
color= "decile",
hover_data=courses.columns
)
# Configure chart labels and focus view
fig_subscribers_ratings.update_layout(
title_text="Number of Subscribers vs. Rating (Only 90th Percentile)",
xaxis_title="Rating",
yaxis_title="Number of Subscribers",
yaxis=dict(range=[0, 300000]) # Set the initial y-axis range to focus on lower deciles
)
# Add an annotation to explain the chart is opened in a focused view
fig_subscribers_ratings.add_annotation(
text="<sup>Opened with Zoom below 300k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
fig_subscribers_ratings.show()
In conclusion, we can say the ratings play a minor role in the number of subscribers, as expected from the correlation matrix, but we see a tendency towards higher ratings in the last decile.
However, without a causal experiment, it's hard to define consumers sensitivity to small variations in the rating. I'd hypothesied that consumers don't make their buying choice, consciously or unconciously, between two courses with one or two decimal difference in rating (often the average difference in different deciles).
Maybe the insight is that, Udemy average ratings are high across the board for a good reason. They likely designed their experience so they prompt a user to rate a course at a specific time to maximize the rating.
The lesson outside platforms is that maintaing high ratings might along with achieving a high number of reviews could create a powerful social proof that spiral success. After all, why would a course with 10000 reviews with average of 4.3 not be the best choice for me?
4.3 Price category will influence the number of subscribers¶
We defined 5 price categories of equal width (usind the pd.cut function). Then each course will be assigned a price category.
# Creating bins for each price category
courses_with_price_category = courses.copy()
courses_with_price_category['price_category'] = pd.cut(
courses_with_price_category['price'],
bins=5,
labels=['$', '$$', '$$$', '$$$$', '$$$$$']
)
# Calculate the stats for each price category
price_category_summary = (
courses_with_price_category
.groupby('price_category', observed=True)['price']
.agg(['min', 'max', 'mean', 'median'])
)
# Show number of courses_with_price_category in each price category
print(price_category_summary) # Print ranges
courses_with_price_category['price_category'].value_counts() # Print stats
min max mean median price_category $ 19.99 54.99 38.62 39.99 $$ 59.99 89.99 72.49 69.99 $$$ 94.99 124.99 103.69 99.99 $$$$ 129.99 159.99 142.46 139.99 $$$$$ 174.99 199.99 191.78 199.99
price_category $ 4978 $$ 4100 $$$ 759 $$$$ 93 $$$$$ 39 Name: count, dtype: int64
We see above a few observations:
- Most courses are in the two lowest price categories.
- The price categories at Udemy, which could be helpful to help course creators price their product.
Let's now understand the course success across all price ranges.
# Create scatter plot
fig_price_subscribers = px.scatter(
courses,
y="num_subscribers",
x="price",
trendline="ols",
color="decile",
hover_data=courses.columns
)
# Configure chart labels and focus view
fig_price_subscribers.update_layout(
title_text="Number of Subscribers vs. Price",
xaxis_title="Price",
yaxis_title="Number of Subscribers",
yaxis=dict(range=[0, 300000]) # Set the initial y-axis range to focus on lower deciles
)
# Add an annotation to explain the chart is opened in a focused view
fig_price_subscribers.add_annotation(
text="<sup>Opened with Zoom below 300k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
fig_price_subscribers.show()
We see here again that at the last decile, there's a slight tendency of higher prices being associated with higher number of subscribers.
Unfortunately, we have only a price snapshot, so it's hard to say if the price has been adjust after a course reached a certain number of subscribers.
Udemy is constantly running sales so the original prices are not reliable, however price discounts are are normally proportional to course original prices and applied to all courses.
My hypothesis is that, as soon as the discounted prices enter the consumer affordability range, then the absolute discount might influence drastically the buying decision, so courses with originally higher prices might benefit more from sales that Udemy constantly runs.
My advice for course creators would be too price their course in the middle to upper range, and consider rasing price once and after getting enough momentum (number of subscribers).
4.4 Time of publication will influence the number of subscribers¶
First, let's have a look at the number of courses launched per year, and the cumulative value.
# Prepare a new dataframe with created year and sorted by year
courses_with_created_year = courses.copy()
courses_with_created_year['created_year'] = (
courses_with_created_year['created']
.apply(lambda date: pd.Timestamp(date).year)
)
# Sort by created date to create line plot
df_line_trace = courses_with_created_year.sort_values(by='created', ascending=True)
df_line_trace.reset_index(drop=True, inplace=True) # Reset index and drop the old index
# Plot line with course 'index' over the time
line_trace = go.Scatter(
x=df_line_trace['created'],
y=df_line_trace.index,
name="Cumulative courses created"
)
# Counting the values per year
df_bar_trace = (
courses_with_created_year['created_year']
.value_counts()
.reset_index()
)
# Plotting the number of courses per year
bar_trace = go.Bar(
y=df_bar_trace['count'],
x=df_bar_trace['created_year'],
name="Courses created per year"
)
fig = go.Figure(data=[line_trace, bar_trace])
fig.update_xaxes(tickvals=courses_with_created_year['created_year'], tickformat="%Y")
fig.update_layout(title_text="Courses over time")
fig.show()
It seems there's a reduction of courses launched in this category ('development') since 2020, which is worrysome for Udemy, or is it a general trend in the market? That's surprising.
Let's now look at the number of subscribers per year. First only for courses below the 90th percentile.
courses_with_created_year_90p = courses_with_created_year[~courses_with_created_year['decile'].isin(['Decile 10'])] # Remove the last decile
# Plotting box chart
fig_subscribers_year = px.box(
courses_with_created_year_90p,
x='created_year',
y='num_subscribers',
hover_data=courses_with_created_year_90p.columns
)
fig_subscribers_year.update_xaxes(
tickvals=courses_with_created_year_90p['created_year'],
tickformat="%Y"
)
fig_subscribers_year.update_layout(title_text="Number of subscribers by year (90th Percentile)")
# Show the box plot
fig_subscribers_year.show()
It seems for the majority of the courses below the 90th percentile, the creation date will not significantly influence the number of subscribers. Maybe for courses that have not been a top success in the first year(s), time alone will not make them break into the top success group.
Now let's look at the top 10% courses.
# Keep only last decile
courses_created_year_d10 = courses_with_created_year[courses_with_created_year['decile'].isin(['Decile 10'])]
# Create box chart
fig_subscribers_years = px.box(
courses_created_year_d10,
x='created_year',
y='num_subscribers',
hover_data=courses_created_year_d10.columns
)
# Config_subscribers_yearsure chart labels and focus view
fig_subscribers_years.update_layout(
title_text="Number of subscribers by year (Top 10%)",
xaxis_title="Year",
yaxis_title="Number of subscribers",
yaxis=dict(range=[20000, 300000]) # Set the initial y-axis range from 0 to 10
)
fig_subscribers_years.update_xaxes(tickvals=courses_created_year_d10['created_year'], tickformat="%Y")
# Add an annotation to explain the chart is opened in a focused view
fig_subscribers_years.add_annotation(
text="<sup>Opened with Zoom below 300k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
# Show the box plot
fig_subscribers_years.show()
For courses published 3+ year ago, we see more often courses with more subscribers.
My hypothesis is that accumulating that many subscribers takes time. For example, the most successfull course, with 1.87M subscribers, was launched in 2015. The challenge then is to keep the course relevant through time.
My recommendation for course creators is to be patient. There are fewer courses in the top 10% that were created in 2022-2023 when compared to 2019-2021.
4.5 Length of curriculum & video lectures will drive up subscribers.¶
We have seen a moderate correlation between curriculum num_curriculum_items and num_subscribers in the last decile. Similarly with content_length_hours. Let's look at it more closely.
fig_subscribers_curriculumitems = px.scatter(
courses,
x="num_curriculum_items",
y="num_subscribers",
trendline="ols",
color="decile",
hover_data=courses.columns
)
fig_subscribers_curriculumitems.update_layout(
height=600,
title_text="Number of Subscribers vs Curriculum Items",
yaxis=dict(range=[0, 1000000]) # Set the initial y-axis range to make the visual less affected by outliers
)
# Add an annotation to explain the chart is opened in a focused view
fig_subscribers_curriculumitems.add_annotation(
text="<sup>Opened with Zoom below 1M subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
fig_subscribers_curriculumitems.show()
As you can see for the first 9 deciles there's not a big impact of the number of curriculum items on the number of subscribers. In the last decile, you see a higher tendency of the courses having more curriculum items.
My hypothesis is that consumers can judge the content of a course better when there's a lot more of curriculum items outlined as they can't audit all of them before buying the course. Also more curriculum items might give a sense that a course is more comprehensive, reducing fear of missing out on important content.
My recommendation for course creators would be to create an extensive curriculum with many items.
Let's do the same analysis with the length of the course (hours of video).
fig_scatter_content_length = px.scatter(
courses,
x="content_length_hours",
y="num_subscribers",
trendline="ols",
color="decile",
hover_data=courses.columns
)
fig_scatter_content_length.update_layout(
height=600,
title_text="Number of Subscribers vs Content length (hours)",
yaxis=dict(range=[0, 1000000]) # Set the initial y-axis range to make the visual less affected by outliers
)
# Add an annotation to explain the chart is opened in a focused view
fig_scatter_content_length.add_annotation(
text="<sup>Opened with Zoom below 1M subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
fig_scatter_content_length.show()
No surprises in the picture above given the very strong positive correlation between content length and number of curriculum items (more curriculum items = more video lectures). The trend is consistent with the previous scatter plot, that is, content length increase is associated with more subscribers only in the last decile.
My recommendation to a course creator is to ensure her course has a comprehrensive curriculum, and enough content to match it.
4.6 Courses for beginners or intermediates will have more subscribers than for experts.¶
print(courses['instructional_level'].value_counts())
instructional_level All Levels 5265 Beginner Level 3164 Intermediate Level 1411 Expert Level 128 Name: count, dtype: int64
The vast majority of courses are aimed at all levels or beginners. It makes generally sense as higher levels of instruction have a smaller adressable market size.
Let's plot the sum of subscribers by instruction level.
fig_subscribers_instructionlevel = px.histogram(
courses,
x='instructional_level',
y='num_subscribers',
hover_data=courses.columns,
color='decile',
barmode="group",
title="Sum of Subscribers by Instructional Level",
labels={'instructional_level': 'Instructional Level', 'num_subscribers': 'Subscribers'}
)
fig_subscribers_instructionlevel.show()
We can see above that at each level (All Levels, Beginner Level, Intermediate Level, Expert Level) the sum of subscribers per decile behaves similarly.
My hypothesis is that this is a reflection of the audience of Udemy, it's probably not attracting consumers with an existing degree of experience in the topic they want to learn. Maybe the best advice for course creators in this situation would be to market their course in platforms better targeted at intermediates and experts.
4.7 Number of subscribers will be concentrated in just few subcategories (E.g. Data Science, Web Development, etc).¶
courses_aggregated_subcategories = (
courses
.groupby('subcategory')['num_subscribers']
.agg(['min', 'max', 'mean', 'median', 'sum', 'count'])
.reset_index()
)
Let's quickly get an overview of the subcategories inside development
fig_courses_per_subcategory = px.histogram(
courses_aggregated_subcategories.sort_values(by='count', ascending=False),
x='subcategory',
y='count',
title="Courses per subcategory",
labels={'subcategory': 'Subcategory', 'count': 'Courses'}
)
fig_courses_per_subcategory.show()
We see the subcategories with the biggest number of courses are Web Development and Programming Languages.
Now visualizing the total number of subscribers in each category, we see a similar picture.
fig_subscribers_subcategory = px.histogram(
courses_aggregated_subcategories.sort_values(by='count', ascending=False),
x='subcategory',
y='sum',
title="Total subscribers per subcategory",
labels={'subcategory': 'Subcategory', 'sum': 'Subscribers'}
)
fig_subscribers_subcategory.show()
Lastly, let's see the dispersion of course success per subcategory.
# Order of subcategories based on median to use in the chart
median_values = (
courses
.groupby('subcategory')['num_subscribers'] # Group by subcategory
.median() # Get the median for subcategory
.sort_values() # Sort median values
.index # Get the indexes of the sorted subcategories to use in the order of the subcategories
)
fig_subscribers_subcategory = px.box(
courses,
hover_data=courses.columns,
x="subcategory",
y="num_subscribers",
title="Number of Subscribers per Subcategory",
category_orders={'subcategory': median_values},
labels={'subcategory': 'Subcategory', 'num_subscribers': 'Subscribers'}
)
fig_subscribers_subcategory.add_annotation(
text="<sup>Opened with Zoom below 50k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
fig_subscribers_subcategory.update_layout(
height=800,
yaxis=dict(range=[0, 50000]), # Set the initial y-axis range to make the visual less affected by outliers
)
fig_subscribers_subcategory.show()
IT certifications mean is high but there are only 8 courses in this subcategory. In most categories, the average course is not really 'successful' but disproportional winners are pulling up each subcategory box.
We cannot recommend a winning subcategory for course creators. There are multiple top courses in every category with the majority of courses (quartile 3) not comparing well with the most successful ones. However, 'Web Development', 'Data Science' and 'Programming Languages' are the most popular subcategories both for course creators and students.
4.8 Number of subscribers will be concentrated in just a few labels like Python.¶
from ast import literal_eval
# Create a working copy of the dataframe
courses_exploded_labels = courses.copy()
# Ensure labels are treated as a list (not strings)
courses_exploded_labels['labels'] = courses_exploded_labels['labels'].apply(literal_eval)
# Explode the rows (one row per label in course)
courses_exploded_labels = courses_exploded_labels.explode('labels')
# Clean up rows with no labels ('No labels)
courses_exploded_labels['labels'] = courses_exploded_labels['labels'].fillna('No Label')
# Sort the rows based on label name
courses_exploded_labels.sort_values(by='labels', ascending=True, inplace=True)
A quick look at the data, shows how many labels and how often they appear.
top50_label_count = (
courses_exploded_labels
.groupby('labels')['num_subscribers']
.agg(['min', 'max', 'mean', 'median', 'sum', 'count'])
.reset_index()
.sort_values(by='count', ascending=False)
.head(50)
)
fig_top50_labels = px.histogram(
top50_label_count,
hover_data=top50_label_count.columns,
x="labels",
y="count",
title="Top 50 Labels by number of courses",
labels={'labels': 'Label name', 'count': 'Courses'}
)
fig_top50_labels.update_xaxes(
tickvals=top50_label_count['labels'], tickfont=dict(size=10)
)
fig_top50_labels.update_layout(
height=600
)
fig_top50_labels.show()
There are 1343 labels in total. Python appears most often, followerd by Javascript.
Now we look at total number subscribers each label has.
top50_label_subscribers = (
courses_exploded_labels
.groupby('labels')['num_subscribers']
.agg(['min', 'max', 'mean', 'median', 'sum', 'count'])
.reset_index()
.sort_values(by='sum', ascending=False)
.head(50)
)
fig_label_subscribers = px.histogram(
top50_label_subscribers,
hover_data=top50_label_subscribers.columns,
x="labels",
y="sum",
title="Top 50 labels by Number of subscribers",
labels={'labels': 'Label name', 'sum': 'Subscribers'}
)
fig_label_subscribers.update_xaxes(tickvals=top50_label_subscribers['labels'], tickfont=dict(size=10))
fig_label_subscribers.update_layout(
height=600
)
fig_label_subscribers.show()
We see that in fact a few labels like Python and Javascript accumulated a lot of subscribers compared to others labels.
The chart above contain only the top 50 labels. Let's visualize all labels with the ability to filter by decile.
# Create a new dataframe for plotting labels
labels_with_deciles = (
courses_exploded_labels
.groupby('labels')['num_subscribers'] # Group by label and get number of subscribers
.agg(['sum', 'count']) # Then create sum and count columns
.sort_values(by='sum', ascending=False) # Sort final dataframe by the sum per label
.reset_index() # Reset index to not use label names as index
)
# Assign a decile for each label
labels_with_deciles['decile'] = pd.qcut(
labels_with_deciles['sum'],
10,
labels=['Decile 1', 'Decile 2', 'Decile 3', 'Decile 4', 'Decile 5',
'Decile 6', 'Decile 7', 'Decile 8', 'Decile 9', 'Decile 10']
)
fig_scatter_labels_with_decile = px.scatter(
labels_with_deciles,
y="decile",
x="sum",
color="decile",
hover_data=labels_with_deciles.columns
)
fig_scatter_labels_with_decile.add_annotation(
text="<sup>Opened with Zoom below 800k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
fig_scatter_labels_with_decile.update_layout(
height=600,
title_text="Number of Subscribers per Label (All deciles)",
xaxis=dict(range=[0, 800000]) # Set the initial y-axis range to make the visual less affected by outliers
)
fig_scatter_labels_with_decile.show()
From the scatter plot above we can see the range of labels success at each decile. It's interesting to see how the range increases at higher deciles.
Lastly, focusing only on the top 20 labels (by number of subscribers), let's see how course success is spread for each label.
# Get top 20 labels by number of subscribers
top20_labels = (
courses_exploded_labels
.groupby('labels')['num_subscribers'] # Group by label and get number of subscribers
.agg(['sum']) # Then create sum
.sort_values(by='sum', ascending=False) # Sort final dataframe by the sum per label
.head(20) # Get top 20
.reset_index() # Reset index to not use label names as index
)
# Filter all courses with these labels
courses_with_top20_label = courses_exploded_labels[courses_exploded_labels['labels'].isin(top20_labels.labels)]
# Get median of number of subscribers for the top 20 labels
median_values = (
courses_with_top20_label
.groupby('labels')['num_subscribers'] # Group courses by labels and get the number of subscribers for the group
.median() # Get the median of the group
.sort_values(ascending=False) # Sort in descending order
.index # Get the labels
)
fig_top20_labels = px.box(
courses_with_top20_label,
hover_data=courses_with_top20_label.columns,
x="labels",
y="num_subscribers",
category_orders={'labels': median_values},
title="Top 20 labels dispersion of course success",
labels={'labels': 'Label name', 'num_subscribers': 'Number of Subscribers'}
)
fig_top20_labels.add_annotation(
text="<sup>Opened with Zoom below 50k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.05,
showarrow=False,
)
fig_top20_labels.update_layout(
height=600,
yaxis=dict(range=[0, 60000]), # Set the initial y-axis range to make the visual less affected by outliers
)
fig_top20_labels.show()
We see that for each label in the top 20 there are courses with varied number of , and a decent amount of outliers. So choosing a subcategory won't guarantee the course success. That said a few categories have more outliers, and we can only speculate over the reasons. Maybe in these categories (E.g. Python) there are more courses and therefore a higher chance of producing a great course, or maybe there's more demand for these topics.
Naturally, the course creator tends to be a specialist in a given topic. But for companies that produce courses in multiple topics, I'd consider launching more courses in these topics that attract more subscribers. Alternatively, use additional market demand data (trends and current volumes of google search for example) to find topic opportunities that are underserved at Udemy.
4.9 Number of subscribers will be concentrated in a few instructors¶
Is there a success formula that a few instructors figured out? Having manny courses help to cross-promote students? We'll look at these questions.
from ast import literal_eval
# Create a working copy of the dataframe
courses_exploded_instructors = courses.copy()
# Ensure instructors are treated as lists not strings
courses_exploded_instructors['instructors'] = courses_exploded_instructors['instructors'].apply(literal_eval)
# Explode the courses list with one instructor per row
courses_exploded_instructors = courses_exploded_instructors.explode('instructors')
# Get the instructor list with count per instructor
courses_exploded_instructors['instructors'].value_counts()
instructors
Packt Publishing 205
Bluelime Learning Solutions 171
OAK Academy Team 129
Oak Academy 128
Laurence Svekis 124
...
Jamie Henry 1
Venkatesh Chandra 1
Scott Bromander 1
Cstech Training 1
Bernard Martin 1
Name: count, Length: 3849, dtype: int64
As we see there are 3849 instructors, a few os them are companies that produces multiple courses and distribute also on Udemy (E.g. Packt Publishing). But we can see that 3849 instructors produced the 9969 courses.
Below we see that 1590 instructors produced more than 1 course.
instructors_count = courses_exploded_instructors['instructors'].value_counts().reset_index()
instructors_count_morethan1course = instructors_count[instructors_count['count'] > 1]
instructors_count_morethan1course
| instructors | count | |
|---|---|---|
| 0 | Packt Publishing | 205 |
| 1 | Bluelime Learning Solutions | 171 |
| 2 | OAK Academy Team | 129 |
| 3 | Oak Academy | 128 |
| 4 | Laurence Svekis | 124 |
| ... | ... | ... |
| 1586 | Mehdi Haghgoo | 2 |
| 1587 | Data Science Lovers | 2 |
| 1588 | Amer Sharaf | 2 |
| 1589 | Sanjay Kumar | 2 |
| 1590 | Mostafa Mahmoud | 2 |
1591 rows × 2 columns
Let's take a look at the instructors that have more than 1M subscribers. Let's define them as the top instructors.
# Organize instructors list for visualization
instructors_stats_summary = (courses_exploded_instructors.pivot_table(
index='instructors', # Group by instructor
values=['udemy_id', 'num_subscribers'], # Keep these 2 columns
aggfunc={
'udemy_id': 'count', # Count courses
'num_subscribers': 'sum' # Sum subscribers
},
))
# Reorganize columns for visualization
instructors_stats_summary.reset_index(inplace=True)
instructors_stats_summary.rename(columns={'udemy_id': 'num_courses'}, inplace=True)
# Get instructors with more than 1M subscribers then sort by number of subscribers.
instructors_1M_subscribers = (
instructors_stats_summary[instructors_stats_summary['num_subscribers'] > 1000000]
.sort_values(by='num_subscribers', ascending=False)
)
# Chart results
fig_instructors_subscribers = px.histogram(
instructors_1M_subscribers,
hover_data=instructors_1M_subscribers.columns,
x="instructors",
y="num_subscribers",
title="Instructors with more than 1M subscribers",
labels={'instructors': 'Instructor name', 'num_subscribers': 'Number of Subscribers'}
)
fig_instructors_subscribers.update_xaxes(tickfont=dict(size=10))
fig_instructors_subscribers.update_layout(
height=600
)
fig_instructors_subscribers.show()
I found small inconsistencies between the calculation using this dataset and what is reported by Udemy, or self reported by some Instructors. Nonetheless, the data is directionally right. For examples, follow: You Accel, Max Schwarzmuller, Jose Portilla
There are 38 instructors out of the 3849 that have accumulated more than 1M subscribers. How many courses have these super successful instructors created? I'm interested to see if these instructors achieved success with one big hit or with multiple courses.
fig_instructors_courses = px.histogram(
instructors_1M_subscribers.sort_values(by='num_courses', ascending=False),
hover_data=instructors_1M_subscribers.columns,
x="instructors",
y="num_courses",
title="Number of courses created by Instructors with more than 1M subscribers",
labels={'instructors': 'Instructor name', 'num_courses': 'Courses'}
)
fig_instructors_courses.update_xaxes(tickfont=dict(size=10))
fig_instructors_courses.update_layout(height=600)
fig_instructors_courses.show()
I found small inconsistencies between the calculation using this dataset and what is reported by Udemy. It seems in this dataset the free courses, and translations (duplications) of courses were not considered. However, the differences are small, and the data is directionally right.
As you can see most of them have created multiple courses to achieve their success. But do they achieve success with one single course or they acummulate subscribers from multiple courses? Let's take a close look at the courses from these sucessful instructors and see which decile (based on number of subscribers) they are in.
# Filter all courses for the instructors with 1M+ subscribers
courses_with_top_instructors = courses_exploded_instructors[courses_exploded_instructors['instructors'].isin(instructors_1M_subscribers['instructors'])]
# Get median of number of subscribers for the top instructors
instructors_orderedby_subscribers = (
courses_with_top_instructors
.groupby('instructors')['num_subscribers'] # Group courses by instructor and get the number of subscribers for the group
.sum() # Get the sum of the group
.sort_values(ascending=False) # Sort in descending order
.index # Get the Instructors
)
# Chart results
fig_top_instructors = px.scatter(
courses_with_top_instructors,
hover_data=courses_with_top_instructors.columns,
x="instructors",
y="num_subscribers",
color="decile",
category_orders={'instructors': instructors_orderedby_subscribers}, # Order from instructors with most subscribers
title="Top instructors course hits",
labels={'instructors': 'Instructor name (Ordered by total number of subscribers)', 'num_subscribers': 'Number of Subscribers'}
)
fig_top_instructors.add_annotation(
text="<sup>Opened with Zoom below 100k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.1,
showarrow=False,
)
fig_top_instructors.update_layout(yaxis=dict(range=[0, 100000])) # Limiting view range for the chart on open to make it better to visualize
fig_top_instructors.update_layout(height=600)
fig_top_instructors.show()
As you can see above, even the most successful instructors launched plenty of 'unsuccessful' courses. Another way to evaluate the success of an instructor, is to calculate how 'efficient' are their content at acquiring success? For that we can calculate a new variable with subscribers per course produced. Let's include in this analysis instructors with a success rate of 100k subscribers/course created.
instructors_stats_summary['success_rate'] = instructors_stats_summary['num_subscribers'] / instructors_stats_summary['num_courses']
# Get instructors with more than 100k subscribers per course produced
instructors_100k_successrate = (
instructors_stats_summary[instructors_stats_summary['success_rate'] > 100000] # Filter instructors based on success rate
.sort_values(by='success_rate', ascending=False) # Sort by success rate
)
# Create chart
fig_instructors_success_rate = px.bar(
instructors_100k_successrate,
hover_data=instructors_100k_successrate.columns,
x="instructors",
y="success_rate",
title="Instructors with more than 100k subscribers per course created",
labels={'instructors': 'Instructor name', 'success_rate': 'Success rate'}
)
fig_instructors_success_rate.update_xaxes(tickfont=dict(size=10))
fig_top_instructors.update_layout(height=600)
fig_instructors_success_rate.show()
We see that only 62 instructors could achieve a success rate of more than 100k subscribers per course created. Finally, that becomes clear in the histogram below. We see the count of instructors per success rate group. A success rate above 50k subscribers per course is rare.
# Chart success rate in bins
fig_instructors_success_rate = px.histogram(
instructors_stats_summary,
x="success_rate",
nbins=100,
title="Count of Instructors per success rate group",
labels={'success_rate': 'Success rate'}
)
fig_instructors_success_rate.add_annotation(
text="<sup>Opened with Zoom on x-axis below 150k subscribers. Use autoscale to view all data. </sup> ",
xref="paper", yref="paper",
x=1, y=1.1,
showarrow=False,
)
fig_instructors_success_rate.update_layout(xaxis=dict(range=[0, 150000]))
fig_instructors_success_rate.show()
Ultimately, the recommendation for an instructors is to not stop at the first sign of failure nor success. There are fewer instructors that sustain a high success rate with multiple courses, but we also have seen that the most successful instructors with more than 1M subscribers have launched courses across with varying degree of success to get there.